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Abstract 

Background: The interactions among genetic factors related to diseases are called epistasis. With the availability of 
genotyped data from genome-wide association studies, it is now possible to computationally unravel epistasis 
related to the susceptibility to common complex human diseases such as asthma, diabetes, and hypertension. 
However, the difficulties of detecting epistatic interaction arose from the large number of genetic factors and the 
enormous size of possible combinations of genetic factors. Most computational methods to detect epistatic 
interactions are predictor-based methods and can not find true causal factor elements. Moreover, they are both 
time-consuming and sample-consuming. 

Results: We propose a new and fast Markov Blanket-based method, FEPI-MB (Fast EPistatic Interactions detection 
using Markov Blanket), for epistatic interactions detection. The Markov Blanket is a minimal set of variables that can 
completely shield the target variable from all other variables. Learning of Markov blankets can be used to detect 
epistatic interactions by a heuristic search for a minimal set of SNPs, which may cause the disease. Experimental 
results on both simulated data sets and a real data set demonstrate that FEPI-MB significantly outperforms other 
existing methods and is capable of finding SNPs that have a strong association with common diseases. 

Conclusions: FEPI-MB algorithm outperforms other computational methods for detection of epistatic interactions 
in terms of both the power and sample-efficiency. Moreover, compared to other Markov Blanket learning methods, 
FEPI-MB is more time-efficient and achieves a better performance. 



Background 

In recent years, the success of GWAS (genome-wide 
association studies) makes it possible to detect genetic 
factors that influence the susceptibility to particular dis- 
eases in human populations [1]. While most of GWAS 
search for one single contributing locus at a time, they 
fail to identify the combinational effect (epistasis) of 
genetic variants (i.e., single-nucleotide polymorphisms, 
or SNPs) associated with common complex diseases 
such as asthma, diabetes, and hypertension [2]. It is well 
known that epistatic interactions, not individual variant, 
are critical in unravelling genetic causes of complex 
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human diseases [3]. However, the number of possible 
combinations of SNPs in a genome is enormous, which 
is infeasible to be evaluated exhaustively by experimental 
methods. Therefore, researchers resort to computational 
methods to detect epistatic interactions based on the 
genotyped data [2,4]. 

Recently, a number of statistical methods have been 
proposed to detect epistatic interactions. Among these 
methods, the most commonly used one is logistic 
regression (LR) [5]. However, logistic regression may 
not be appropriate for epistasis due to its overfitting 
problem due to the fact that the number of parameters 
will be much larger than the available samples. To avoid 
this shortcoming, Ritchie et al, proposed MDR (multi- 
factor dimensionality reduction) [6,7], which utilizes the 
ratio of the number of cases to the number of controls 
in cells of risk table to reduce the dimensionality to one 
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and select SNP combinations that have the highest pre- 
diction performance. The process of labelling each cell 
of risk table as "high risk" or "low risk" is a process of 
estimating parameters, which may also result in the 
overfitting problems when the size of SNP combinations 
is large. Furthermore, MDR selects the SNP combina- 
tions purely by the prediction performance and thus, it 
may not find true causal factors. Park and Hastie pro- 
posed the stepwise-penalized logistic regression 
(stepPLR) to overcome the drawbacks of logistic regres- 
sion and MDR [8]. StepPLR makes some simple modifi- 
cations for standard logistic regression (LR). For 
example, stepPLR combines the LR criterion with a 
penalization of the L2-norm of the coefficients. This 
modification makes stepPLR more robust to high-order 
epistatic interactions. Despite its modifications, stepPLR 
is time-consuming when estimating parameters, which 
is one essential limitation of regression methods. BEAM 
is a Bayesian marker partition model using Markov 
Chain Monte Carlo to reach an optimal marker parti- 
tion and a new B statistic to check each marker or set 
of markers for significant associations [9]. Note that 
most statistical methods can not be applied to genome- 
wide analysis directly due to their computational com- 
plexity. The alternative approaches to parametric statis- 
tical methods are machine learning methods including 
Support Vector Machine (SVM) [10] and Random For- 
est [11]. Machine learning methods consider detecting 
epistatic interactions as a feature selection problem [12] 
and try to find the best combination of SNPs with the 
highest prediction accuracy of disease status. Therefore, 
Chen et al. test three feature selection method: RFE 
(recursive feature elimination), RFA (recursive feature 
addition), and GA (genetic algorithm) in [10] and Jiang 
et al. perform a greedy search in [11]. Like MDR, 
machine learning methods select SNPs based on classifi- 
cation/prediction accuracy and can not find true causal 
factors for disease. Moreover, machine learning-based 
methods tend to introduce many false positives because 
using more SNPs tends to produce higher classification 
accuracies. 

In this paper, we propose a new and fast Markov 
Blanket method, FEPI-MB (Fast EPistatic Interactions 
detection using Markov Blanket), to detect epistatic 
interactions. The Markov Blanket is a minimal set of 
variables, which can completely shield the target variable 
from all other variables. As shown in Figure 1, genome- 
wide association studies try to identify the /c-way inter- 
action among disease SNPs: SNP1, SNP2,...,SNPk and 
exclude all other unrelated normal SNPs (SNPk+1,..., 
SNPn). Thus, the Markov Blanket learning method is 
suitable for detection of epistatic interactions in gen- 
ome-wide case-control studies, e.g., to identify a 




Figure 1 Example of genome-wide association studies (GWAS). 

The goal of genome-wide association studies is to identify the k- 
way interaction among disease SNPs: SNP1, SNP2, SNPk. 

V J 



minimal set of SNPs which may cause the disease and 
require further experiments. Meanwhile the detected 
minimal set of causal SNPs can shield the disease from 
all other normal SNPs to decrease the false positive rate 
and reduce the cost of future validation experiments. 
Furthermore, Markov Blanket method performs a heur- 
istic search by calculating the association between vari- 
ables to avoid the time-consuming training process as in 
SVM and Random Forest. 

Some Markov Blanket methods take a divide-and- 
conquer approach that breaks the problem of identify- 
ing Markov Blanket of variable T (MB (T)) into two 
subproblems: first, identifying parents and children of 
T (PC (T)) and, second, identifying the parents of the 
children of T (spouse). The goal of epistatic interac- 
tions detection is to identify causal interacting genes 
or SNPs for some certain diseases and therefore it is a 
special application of Markov Blanket method because 
we only need to detect the parents of the target vari- 
able T (disease status labels). Our new Markov Blanket 
method makes some simplifications to adapt to this 
special condition. 

We apply the FEPI-MB algorithm to simulated data- 
sets based on four disease models and a real dataset (the 
Age-related Macular Degeneration (AMD) dataset). We 
demonstrate that the proposed method significantly out- 
performs other commonly-used methods and is capable 
of finding SNPs strongly associated with diseases. Com- 
paring to other Markov Blanket learning methods, our 
method is faster and can still achieve a better 
performance. 

Results 

Simulated data generation 

We first evaluate the proposed FEPI-MB on simulated 
data sets, which are generated from three commonly 
used two-locus epistatic models [5,9] and one three- 
locus epistatic model developed [9]. Table 1 lists the dis- 
ease odds for these four epistatic models, where a is the 
baseline effect and 6 is the genotypic effect. Assume 
that an individual has genotype g A at locus A and 
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Table 1 Four disease models 
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where /?CD|&4>£b) * s the probability that an individual 
has the disease given genotype (g A &) and p(D | g A ,g B ) 
is the probability that an individual does not have the 
disease given genotype (gA<&)- 

In Modell the odds of disease increase in a multiplica- 
tive mode both within and between two loci. For exam- 
ple, an individual with Aa at locus A has larger odds, 
which are 1 + 6 times relative to those of an individual 
who is homozygous AA; the aa homozygote has further 
increased disease odds by (1 + 6) 2 . We can also find 
similar effects on locus B. Finally the odds of disease for 
each combination of genotypes at loci A and B can be 
obtained by the product of the two within-locus effects. 
Model2 demonstrates two-locus interaction multiplica- 
tive effects because at least one disease-associated allele 
must be present at each locus to increase the odds 
beyond the baseline level. Moreover the increment of 
the disease-associated allele at loci A or B can further 



increase the disease odds by the multiplicative factor 1 
+ 0. ModeB specifies two-locus interaction threshold 
effects. Like Model 2, ModeB also requires at least one 
copy of the disease-associated alleles at both loci A and 
B. However the increment of the disease-associated 
allele does not increase the risk further. We call this as 
disease threshold effect. It means that a single copy of 
the disease-associated allele at each locus is required to 
increase odds of disease and this is the disease thresh- 
old. But after the disease threshold has already been 
met, having both copies of the disease-associated allele 
at either locus has no additional influence on disease 
odds. There are three disease loci in model 4. Some cer- 
tain genotype combinations can increase disease risk 
and there are almost no marginal effects for each dis- 
ease locus. Model 4 is more complex than Models 1, 2 
and 3. All these four models are non-additive models 
and they differ in the way that the number of disease- 
associated allele increases the odds of disease. The pre- 
valence of a disease is the proportion the total number 
of cases of the disease in the population and we assume 
that the disease prevalence is 0.1 for all these four dis- 
ease models [9]. 

To generate data, we need to determine three para- 
meters associated with each model: the marginal effect 
of each disease locus (A), the minor allele frequencies 
(MAF) of both disease loci, and the strength of linkage 
disequilibrium (LD) between the unobserved disease 
locus and a genotyped locus [5]. LD is a nonrandom 
association of alleles at different loci and is quantified 
by the squared correlation coefficient r 2 calculated from 
allele frequencies [5]. In this paper, we set A equal to 
0.3, 0.3, and 0.6 for models 1, 2, and 3, respectively. For 
model 4, we set 6 = 7 arbitrarily because there are 
almost no marginal effects in model 4. We let MAF 
take four values (0.05, 0.1, 0.2, and 0.5) and let r 2 take 
two values (0.7, 1.0) for each model. For each non-dis- 
ease marker we randomly chose its MAF from a uni- 
form distribution in [0.0. 0.5]. We first generate 50 
small datasets and each dataset contains 100 markers 
genotyped for 1,000 cases and 1,000 controls based on 
each parameter setting for each model. To test the scal- 
ability of FEPI-MB, we also generate 50 large datasets 
and each dataset contains 500 markers genotyped for 
2,000 cases and 2,000 controls using the same parameter 
setting for each model. 

Epistasis detection on simulated data 

We compare the FEPI-MB algorithm with three com- 
monly-used methods: BEAM, SVM and MDR on the 
four simulated disease models. To measure the perfor- 
mance of each method, we use "power" as the criterion 
function. Power is calculated as the fraction of 50 simu- 
lated datasets in which disease associated markers are 
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identified and demonstrate statistically significant asso- 
ciations (G 2 test values below a threshold for FEPI-MB) 
with the disease [9,11]. The BEAM software is down- 
loaded from http://www.fas.harv-ard.edu/~junliu/BEAM 
and we set the threshold of the B statistic as 0.1 [9]. For 
SVM, we use LIBSVM with a RBF kernel to detect epi- 
static interactions and the same searching strategy as 
shown in [13]. Since MDR algorithm can not be applied 
to a large dataset directly, we first reduce the number of 
SNPs to 10 by ReliefF [14], a commonly-used feature 
selection algorithm, and then MDR performs an exhaus- 
tive search for a SNP set that can maximize cross-vali- 
dation consistency and prediction accuracy. For the 
large datasets containing 500 markers genotyped for 
2,000 cases and 2,000 controls, we only compare the 
performance of FEPI-MB, BEAM and SVM because 
ReliefF [14] in MDR can not work for large datasets of 
this scale. 

We show the results on the simulated data in Figures 
2 and 3. As can be seen, FEPI-MB performs the best 
comparing to other three methods. BEAM is the second 
best. In most cases, the powers of MDR are much smal- 
ler than these of the FEPI-MB and BEAM algorithms. 
For the MDR algorithm, the poor performance may be 
due to the use of ReliefF to reduce SNPs from a very 
large dimensionality. We try another comparison experi- 
ment based on the simulated data containing only 40 
markers, which makes us be able to apply MDR to the 
simulated data directly. The performance of MDR is still 
poor and this indicates that perhaps using the risk table 
as a predictor to detect epistatic interactions is not a 
good choice. In some cases, SVM can achieve a compar- 
able or even better performance than FEPI-MB and 
BEAM, however, at the cost of introducing more false 
positives. Figure 3 also demonstrates the scalability of 
FEPI-MB on large datasets. 

An important issue for epistatic interaction detection 
in genome-wide association studies is the number of 
available samples. Typically, the size of samples is lim- 
ited and consequently, computational model behaves 
differently. We explore the effect of the number of sam- 
ples on the performance of BEAM and FEPI-MB (SVM 
will always introduce a large number of false positives 
and thus, is not compared here). We generate synthetic 
datasets containing 40 markers genotyped for different 
number of cases and controls with r 2 = 1 and MAF=0.5. 
The result is shown in Figure 4 and we find that FEPI- 
MB can achieve a higher power than BEAM when the 
number of samples is the same in most cases. On the 
other hand, FEPI-MB needs fewer samples to reach the 
perfect power comparing to BEAM. So we can conclude 
that FEPI-MB is more sample-efficient than BEAM. 

We also compare the performance of FEPI-MB with 
interlAMBnPC based on the large dataset from modell 



Model! (A,=0.3 1^=0.7) 



IFEPI-MB 

□ BEAM 

□SVM 

□ MDR 



■□n In. 



Model! (1=0.3 1^=1) 



£0.6 



IFEPI-MB 

□ BEAM 
□SVM 

□ MDR 



Model2 (A,=0.3 1^=0.7) 



IFEPI-MB 

□ BEAM 
□SVM 

□ MDR 



JdL 



Model2 (A,=0.3 ?=\) 



IFEPI-MB 

□ BEAM 
□SVM 

□ MDR 



sl l n n 



IFEPI-MB 

□ BEAM 

□ SVM 

□ MDR 



Model3 {1=0.6 ?=\) 



Ei 0 - 6 

5 



ModeW (9=7 r 2 =0.7) 



■ FEPI-MB 

□ BEAM 

□ SVM 

■ MDR 



0.05 0.1 0.2 0.5 

MAF 

Model3 (X=0.6 1^=0.7) 





ModeW (9=7 r 2 =1) 



■ FEPI-MB 

□ BEAM 
□SVM 

■ MDR 



- !□! 



Figure 2 Performance comparison for small datasets containing 100 
markers genotyped from 1000 cases and 1000 controls. 



to show the time efficiency of FEPI-MB. Among the 
three variants of IAMB, interlAMBnPC can achieve the 
best performance [15]. Both FEPI-MB and inter- 
lAMBnPC are written in MATLAB and all the 
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Figure 3 Performance comparison for large datasets containing 500 
markers genotyped from 2000 cases and 2000 controls. 



experiments are run on an Intel Core 2 Duo T6600 2.20 
GHz, 4GB RAM and Windows Vista. The results are 
shown in Table 2. As seen, FEPI-MB runs more than 
ten times faster than interlAMBnPC. 



Epistasis detection on AMD data 

FEPI-MB demonstrates its greater power, sample-effi- 
ciency, and time-efficiency on simulated data with the 
number of SNPs less than 500. In practical problems, a 
typical GWAS genotype dataset contains at least more 
than 30,000 common SNPs. FEPI-MB can also be scal- 
able to large-scale datasets in real genome-wide case- 
control studies. We apply FEPI-MB to an Age-related 
Macular Degeneration (AMD) dataset, which contains 
116,204 SNPs genotyped with 96 cases and 50 controls 
[16]. AMD (OMIM 603075) [17] is a common genetic 
disease related to the progressive visual dysfunction in 
age over 70 in the developed country. We use the same 
preprocessing method as in [9,16]. After filtering, there 
are 97,327 SNPs lying in 22 autosomal chromosomes 
remained. 

The searching time of FEPI-MB for AMD-related 
SNPs on an Intel Core 2 Duo T6600 2.20 GHz, 4GB 
RAM and Windows Vista is 96.4s and FEPI-MB detects 
two associated SNPs: rs380390 and rs2402053, which 
have a G 2 test p-value of 5.36*10~ 10 . The first SNP, 
rs380390, is already found in [16] with a significant 
association with AMD. The other SNP detected by the 
FEPI-MB algorithm is SNP rs2402053, which is inter- 
genic between TFEC and TES in chromosome 7q31 
[18]. 

It is worth noting that several lines of evidence have 
previously shown the long arm of 7q harbors genes 
implicated in retinal disorders. Among which is map- 
ping of a locus on 7q31-q32 for retinitis pigmentosa, 
another retinal disease [19]. Ocular abnormalities have 
been reported for an individual with terminal deletion 
of chromosome 7q [20]. Mutations in a gene located at 
7q32 have been reported in families with autosomal 
dominant retinitis pigmentosa [21]. More recently, 
Next-generation sequencing revealed mutations in 
another gene located on chromosome 7q31 in patients 
with a form of retinopathy [22]. 

The rs2402053 SNP identified in our study does not 
locate in any of the previously reported implicated genes 
in retinal disorders. Therefore, it may shed light on dis- 
covering a new genetic factor, on chromosome 7q, con- 
tributing to the underlying mechanism of AMD, a 
complex form of retinal degenerative disorder. The real 
mechanism of interaction between rs380390 and 
rs2402053 should be explored further by biological 
experiments. 

Conclusions 

While many computational methods were used for iden- 
tification of epistatic interactions, most existing compu- 
tational methods do not consider the complexity of 
genetic mechanisms causing common diseases and only 
focus on the selection of SNP sets, which show the best 
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Figure 4 Effect of number of samples on the performance of FEPI-MB and BEAM. 



classification capacity. This will introduce many false 
positives inevitably. Furthermore, most existing methods 
cannot directly handle genome-wide scale problems. In 
this paper, we introduce a new and fast Markov 



Blanket-based method, FEPI-MB, to identify epistatic 
interactions. We compared FEPI-MB with three other 
methods, BEAM, SVM and MDR, over both simulated 
datasets and a real dataset. Our results show that the 
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Table 2 Comparison of performance of FEPI-MB and 
interlAMBnPC for the large datasets of Model 1 

Model A r 2 MAF Algorithm Power Average time (s) 



1 0.3 0.7 0.05 FEPI-MB 3 0.4574 

interlAMBnPC 3 7.5505 

0.1 FEPI-MB 6 0.4437 

interlAMBnPC 5 9.2449 

0.2 FEPI-MB 20 0.4436 

interlAMBnPC 20 9.4295 

0.5 FEPI-MB 42 0.4449 

interlAMBnPC 42 8.2823 

1 0.05 FEPI-MB 2 0.4393 

interlAMBnPC 2 7.3610 

0.1 FEPI-MB 12 0.4421 

interlAMBnPC 12 9.7156 

0.2 FEPI-MB 39 0.4431 

interlAMBnPC 38 9.6498 

0.5 FEPI-MB 45 0.4449 

interlAMBnPC 43 9.1229 



FEPI-MB algorithm outperforms other methods in 
terms of the power and sample-efficiency. Moreover, we 
compare FEPI-MB with one of the best Markov Blanket 
learning method, interlAMBnPC. The FEPI-MB is more 
than ten times faster than interlAMBnPC. 

Methods 

Markov blankets 

Bayesian networks represent a joint probability distribu- 
tion / over a set of random variables by a directed acyc- 
lic graph (DAG) G and encode the Markov condition 
property: each variable is conditionally independent of 
its nondescendants, given its parents in G[23]. In a 
Bayesian network, if the probability distribution of X 
conditioned on both Y and Z is equal to the probability 
distribution of X conditioned only on Y, i.e., P(X\ Y, Z) = 
(X | Y), X is conditionally independent of Z given Y. 
This conditional independence is represented as (X _L Z\ 
Y). 

Definition 1 (Faithfulness). A Bayesian network N 
and a joint probability distribution J are faithful to each 
other if and only if every conditional independence 
entailed by the DAG ofN and the Markov Condition is 
also present in J[24], 

Theorem 1. If a Bayesian network N is faithful to a 
joint probability distribution /, then: (I) nodes X and Y 
are adjacent in N if and only if X and Y are condition- 
ally dependent given any other set of nodes. (2) for the 
triplet of nodes X, Y , and Z in N, X and Z are adjacent 
to Y , but Z is not adjacent to X, X — > Y SI Z is a sub- 
graph ofN if and only if X and Z are dependent condi- 
tioned on every other set of nodes that contains Y . 



We can define the Markov Blanket of a variable T, 
MB (T), as a minimal set for which (X _L T\MB(T)), for 
all X g V - {7} - MB(T) where V is the variable set. 
The Markov Blanket of a variable T is a minimal set of 
variables, which can completely shield variable T from 
all other variables. All other variables are probabilisti- 
cally independent of the variable T conditioned on the 
Markov Blanket of variable T. 

Theorem 2. If Bayesian network N is faithful to its 
corresponding joint probability distribution /, then for 
every variable T, MB(T) is unique and is the set of par- 
ents, children, and spouses of T. 

Theorem 1 and Theorem 2 are proven in [25,26], 
separately. We show an example of the Markov Blanket 
in the well-known Asia network in Figure 5. The MB(T) 
of the node 'TBorCancer is the set of gray-filled nodes. 

Given the definition of a Markov Blanket, the prob- 
ability distribution of T is completely determined by the 
values of variables in MB(T). Therefore, the detection of 
Markov Blanket can be applied for optimal variable 
selection and causal discovery. In this paper, we use 
Markov Blanket method to detect potential causal SNPs 
for common complex diseases. 

Markov blankets learning methods 

There are several Markov Blanket learning methods 
such as: Koller-Sahami (KS) algorithm [27], Grow- 
Shrink (GS) algorithm [28], Incremental association 
Markov Blanket (IAMB) algorithm [15], Max-Min Mar- 
kov Blanket (MMMB) algorithm [29], HITON_MB [30] 
and PCMB [31]. 




Figure 5 The Aisa network. The gray-filled nodes are the MB(T) of 
node TBorCancer'. 

k ) 
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Koller-Sahami (KS) algorithm is the first algorithm to 
employ Markov Blanket for feature selection. However, 
there is no theoretical guarantee for Koller-Sahami (KS) 
algorithm to find optimal MB set [27]. The GS algo- 
rithm [24] and IAMB methods [15] are two similar algo- 
rithms with two search procedures, forward and 
backward. In the forward phase, the nodes of MB(T) are 
admitted into MB, while in the backward phase false 
positives are removed from MB. Under the assumptions 
of faithfulness and correct independence test, both the 
GS algorithm and IAMB are proved correct [15]. Com- 
paring to GS algorithm, IAMB might achieve a better 
performance with fewer false positives admitted during 
the forward phase. A common limitation for GS algo- 
rithm and IAMB is that both methods require a very 
large number of samples to perform well. IAMB can be 
revised in two ways: (1) After each admission step in 
forward phase, perform a backward conditioning phase 
to remove false positives to keep the size of MB(T) as 
small as possible. (2) Substitute the backward condition- 
ing phase with the PC algorithm instead [20]. In other 
words, the backward phase will perform the indepen- 
dence test conditioned on all subsets of the current 
Markov Blanket. Tsamardinos et al. proposed three 
IAMB variants: interlAMB, IAMBnPC and Inter- 
IAMBnPC [15]. They also proved the correctness of 
InterlAMBnPC. The time complexity of IAMB is 0(| 
MB|xN) where |MB| is the size of MB and N is number 
of variables. 

To overcome the data inefficient problem of IAMB 
and its variants, Max-Min Markov Blanket (MMMB) 
algorithm [29], HITON_MB [30] and PCMB [31] are 
proposed. All these three algorithms take a divide-and- 
conquer method that breaks down the problem of iden- 
tifying Markov Blanket of variable T into two subpro- 
blems: First, identifying parents and children of T (PC 
(T)) and, second, identifying the spouses of T. Mean- 
while, they have the same two assumptions as IAMB (i. 
e. faithfulness and correct independence test) and take 
into account the graph topology to improve data effi- 
ciency. However, results from MMPC/MB and HITON- 
PC/MB are not always correct since some descendants 
of T other than its children will enter PC(T) during the 
first step of identifying parents and children of T [31]. 
PCMB can be proved correct in [31]. In every loop, 
PCMB first remove unrelated variables, then PCMB use 
IAMBnPC method to admit one feature and remove 
false positives. The problem of PCMB is that the PC 
algorithm performs an exhaustive conditional indepen- 
dence test, which is very time consuming. The reason 
that PC algorithm was used in PCMB and inter- 
IAMBnPC is that PC algorithm is a more sample-effi- 
cient method and is sound under the assumption of 
faithfulness [15]. In fact if the size of Markov Blanket is 



large, PC algorithm still needs a lot of samples to guar- 
antee its performance. There is no theoretical proof and 
guarantee that the PC algorithm admits less false posi- 
tives than other methods. 

Method description: FEPI-MB 

Detecting gene-gene interaction is a special application 
of Markov Blanket learning method because we only 
need to detect the parents of the target variable T and 
don't need to design a complex algorithm to detect 
spouses of T. Here target variable T is the disease status 
labels and the parents of T are those disease SNPs. MB 
(T) only contains the parents of T. 

All Markov Blanket learning methods are based on the 
following two Theorems. 

Theorem 3. If a variable belongs to MB(T) which only 
contains the parents of T, then it will be dependent on T 
given any subset of the variable set V-{T} . 

Proof: This is a direct consequence of Theorem 1 
because now MB(T) only contains the parents of T. D 

Theorem 4. If a variable is not a member of 'MB(T), 
then conditioned on MB(T), or any superset of MB(T), it 
will be independent of T. 

Proof: Let X, Y, Z and W represent four mutually dis- 
joint variable sets. Any probability distribution p satisfies 
the weak union property: x ±{Y{jw)\z ^ x ±Y{z{jw) [25] . 
Based on the definition of Markov Blanket, we get that 
X £ MB(T) => (X 1 T | MB(T)) . Thus, by the weak 
union property, we have (X _L T | (MB(T) U S) for any 
subset S e V - {T} - {X} - MB(T) .□ 

The G 2 test is used to test independence and condi- 
tional independence between two variables for discrete 
data [13,24,32]. The null hypothesis for G 2 test is that 
two variables are independent. As described next, the 
proposed FEPI-MB uses G 2 to test the association and 
independence between SNPs and disease status. 

The detail of our FEPI-MB algorithm is shown in 
Figure 6. It consists of three phases: Remove-MB, 
Forward-MB and Backward-MB. During the phase of 
Remove-MB, unrelated variables are removed from the 
candidate set for Markov Blanket (canMB) based on the 
conditional independence test. This will reduce the 
searching space after each iteration and can help to 
decrease the computational complexity. After the phase 
of Remove-MB, the variable which has the maximal G 2 
score and is associated with the target variable T in 
canMB enters MB(T) in the phase of Forward-MB, 
where false positives are removed during the phase of 
Backward-MB. Comparing to PCMB, we get rid of the 
time-consuming PC algorithm and use the maximal sub- 
set of current MB(T) to perform the conditional inde- 
pendence test in the phase of Backward-MB. The time 
complexity of FEPI-MB is less than the 0(|MB|xN) of 
IAMB because in each iteration after the first iteration 
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^Initialization*/ 

V : set of all variables; T: Target variables; 

MB(T)=^; 

canMB=V-{T}; 

/*our algorithm*/ 

Begin procedure 

Repeat 

Remove-MB; 

Forward-MB; 

Backward-MB; 
Until MB(T) has not changed; 
End procedure 
/^Remove phase*/ 
Begin Remove-MB 

For all x i e canMB; 
g(x i ) = G\x i :T\MB(T)) 
If (x. ± T | MB{T)) 

canMB = canMB - x. ; 
End If 
End For 
End 

/^Forward phase*/ 
Begin Forward-MB 

X = argmax(g(x.)) x. e canMB ; 
If (X 1T\ MB(T)) 

MB(T) = MB(T){J{X}; 

canMB = canMB - X 
End If 

End 

/*Backward phase*/ 
Begin Backward-MB 
For all Y g MB(T) 
If (Y 1T\MB(T)-Y) 

MB(T) = MB(T) - {Y} ; 
End If 
End For 
End 



Figure 6 FEPI-MB algorithm. 



the number of conditional independence tests per- 
formed in the phase of Remove-MB is less than N. The 
optimal time complexity of FEPI-MB is O(N). 

Like IAMB and PCMB, the soundness of FEPI-MB is 
based on the assumptions of DAG -faithfulness and cor- 
rect independence test. 

Theorem 5. Under the assumptions that the indepen- 
dence tests are correct and that the data D are 



generated from a probability distribution which is faith- 
ful to a DAG G, FEPI-MB returns all parents of T. 

Proof: First, each node in MB(T) enters MB(T) in the 
Forward-MB phase and will not be removed during the 
Backward-MB phase because if X i e MB(T) , then 
[X i JLT\Z) for any Z c V - {T} - {XJ owing to The- 
orem 3. Second, the nodes outside the MB(T) will be 
removed sooner or later during the Backward-MB phase 
especially after all elements in the Markov Blanket of T 
enter the current MB(T) because of the definition of 
Markov Blanket and Theorem 4. □ 

Even though FEPI-MB is a method based on the 
greedy algorithm, Theorem 3 and Theorem 4 can guar- 
antee that FEPI-MB will not get stuck in a local 
optimum. 
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